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                        1. INTRODUCTION 


  The National Bureau of Standards and the United States Patent Office
are actively collaborating in a long-range program to develop and
apply automatic techniques of information storage and retrieval to
problems of patent search. An important preliminary phase of this
program has been the carrying out of experiments with methods for
locating information in large files of technological and scientific
information.

  In the granting of United States patents it is necessary for patent
examiners to refer to collections that may, in principle, contain from
10⁶ to 10⁷ documents. In effect, when an examiner conducts a
literature search to determine whether a patent application represents
a novel idea which then must be tested against established criteria
for patentability, he must first assure himself, insofar as possible,
that he has exhaustively searched through all literature in the public
domain that might possibly contain any information pertinent to the
given application. It has been estimated that 60 percent of the time
spent by an examiner in processing a patent application is devoted to
searching the technical literature. 1/ In an attempt to reduce this
expenditure of time, the National Bureau of Standards-Patent Office
group has considered, among other techniques, the use of automatic
data processing systems.

  By an automatic data processing system (ADPS) is meant a collection
of machines, usually but not necessarily electronic in nature, which
has the ability to process information in accordance with internally
stored programs and which can perform a whole data

-----------------------
¹
  All references will be found at the end of the report.


processing task involving the use of data storage facilities of
diverse natures without the necessity of manual intervention. The
system also includes devices for the preparation of input data and the
reproduction of output data. SEAC, the NBS Electronic Automatic
Computer, is an ADPS and has been used in successful preliminary
experiments wherein a collection of over 200 descriptions of steroid
compounds is exhaustively searched to answer typical questions that
may occur in evaluating patent applications for new chemical
compounds. This report describes some theoretical ideas on the use of
ADPS for literature searching that have resulted from these
experiments in searching through chemical information.

  In considering any attempt to automatize the searching of technical
literature in the United States Patent Office, it must be remembered
that the historical nonautomatic or manual method of searching which
is presently in effect at the Patent Office utilizes the best
intellectual efforts of some 800 examiners highly trained in diverse
technologies and the legal aspects of isolating significant
information from large technical files. Consequently, there is no a
priori reason to believe that there is any single overall solution to
the literature search problem in the Patent Office which will function
as effectively as this trained corps of examiners. With this
consideration in mind, one area of the inventive arts was selected for
initial experimental investigation in the hope that empirical
solutions in that area could be put into productive operation. A bonus
has been the development of theoretical and experimental techniques
which should prove applicable to the searching of technical literature
in other areas of the Patent Office.

  The area that was selected for initial experimental investigation
was that of "Composition of Matter," i.e., patents generally
concerned with what may loosely be classified as chemistry. Chemists
have for a long time been concerned with information retrieval, and it
was hoped that use could be made of some of the techniques that the
chemists have developed. As it turned out, the experimental results
obtained took advantage of a technique of chemistry that was,
historically, probably not developed for the purpose of information
retrieval; namely the use of chemical structure diagrams for
describing the chemical nature of matter.
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  It is of paramount importance to realize that in using automatic
techniques for the retrieval of technical information, no more
information can be obtained from a file than that information which is
represented according to a well-defined consistent notational scheme.
Because the method of representing chemical structures in diagrammatic
form has just such properties, it was decided to experiment with the
use of SEAC for searching through files of chemical structure diagrams
in response to search requests fed into the machine.


II. THE USE OF ADPS FOR SEARCHING CHEMICAL STRUCTURES 


The Structure Search Problem
----------------------------

In the Patent Office the examiners in the chemical arts have frequent
need for performing so-called generic searches through structure
diagrams. As an example, it can be seen that the fragment structure
which is given in Figure 1 is contained in the two compounds brucine
and codeine shown in Figure 2. Unmarked vertices in the structure are
understood to represent carbon atoms (C). Notice that
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the six-member ring with the nitrogen (N) occurs in codeine even
though the diagram of codeine indicates the ring in a distorted
manner. We can say that the two compounds in Figure 2 share the
generic property of containing this fragment. Part of the experiments
performed on SEAC were concerned with developing a method for
performing generic searches of this type through a file of structures
taken from the art of steroid chemistry.


The Structure Search Routine Used on SEAC
-----------------------------------------

  The Patent Office search requires an unambiguous coding system in
which any combination of atoms and bonds can be represented for
purposes of retrieval. The traditional coding systems 2,3/ are
unsuitable for mechanized search because a given compound can be
represented in conceptually different ways or because the system is so
complex that it can be used only by a trained chemist. Opler 4/ of the
Dow Chemical Company has developed a code for use in machine
searching. This code is flexible, but it is unsuitable for Patent
Office searches because it does not represent the most fundamental
units of the chemical structure, the atoms and their bonds which are
directly required in many typical patent searches.


  In the system used on SEAC each atom in a structural diagram is
numbered serially in arbitrary order. One unit of computer storage,
called a word, is given to each atom to represent its position in the
structure. In each word are listed the numbers of the other atoms, up
to four, that are attached to the atom represented by the word. The
element symbol and the serial number of the atom are also placed in
the word. Thus each atom word has six fields; the serial number of the
atom, four connection fields and an element symbol.


  As an illustration, consider the compound, chloral, shown in Figure
3. The coding would proceed as follows:
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           Cl   H
           |    |
     Cl -- C -- C = O
           |
           Cl

      Chloral
      
      Figure 3

             5   4
           Cl   H
       7  6|    |3 2 1
     Cl -- C -- C = O
          8|
           Cl

      Chloral
      
      Figure 4
 


a. First number the atoms and all bonds other than single bonds in any
arbitrary order as shown in Figure 4.


b. Then set down a list of the connections to each component of the
structure, as shown below:


       Component No.       Connections 
       -------------       -----------
            1                 2
            2                 1-3
            3                 2-4-6
            4                 3
            5                 6
            6                 3-7-8-5
            7                 6
            8                 6
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c. Finally, put in each word the element symbol of the atom it
represents as follows:


      Component No.      Connections    Element Symbol
      -------------      -----------    --------------
            1             2                   O
            2             1-3                 =
            3             2-4-6               C
            4             3                   H
            5             6                   Cl
            6             3-7-8-5             C
            1             6                   Cl
            8             6                   Cl


The list represents the complete code for the structure. The structure
is easily drawn using the code. The code for any structure is not
unique since by numbering the atoms in some other arbitrary order a
different code would be obtained. It is easily seen however, that all
of the possible codes are equivalent.

  It is desired to search a file of coded structures for all
structures which are identical to some question compound or which have
some generic property in the sense previously defined. To do this, the
SEAC search program tries to make an atom-to-atom match between the
atoms of the question structure and the atoms of the first structure
recorded in the file. Each match that is made is considered as
tentative by the program until the search through the first file
structure is completed. Whenever failure to match is discovered by the
program it tries to go back to the previous match to make a new match.
If the program finds that all possible first matches lead to
irreconcilable mismatches, the program will reject the first file
structure and proceed on to the next. When a one-to-one correspondence
exists between each of the atoms of the question and the atoms of part
of the file structure that is being examined, the routine accepts the
structure by printing on the computer output an indication of which
structure was found. The search routine continues this process until
the whole file has been searched.


  The details of the search routine are given in the flow chart in
Figure 5. The symbols used in the chart are defined as follows: 
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   Figure 5. Flow Chart of the Chemical-Structure Search Routine


I, II, III, IV denote the four connection fields in each atom word.

Qᵢ             is a question atom word.

Fb             is a file atom word.

Nᵢ             is a temporary storage location for question atom
               words matched with corresponding file atom words
               (R).

Rᵢ,Rⱼ          is a temporary storage location for file atom words
               matched with corresponding question atom words (N).

α              denotes fields of R. It can equal I, II, III, or IV.

B              denotes fields of N. It can equal I, II, III, or IV.


The Use of Screens for Speeding up the Search 
---------------------------------------------

  As fast as a high-speed electronic computer is the fact remains that
performing a detailed search of the type described would be altogether
too time-consuming unless some short-cuts could be devised which would
in no way compromise the exhaustiveness or accuracy of the search,
while speeding up the process greatly. A technique is needed that will
enable the ADPS to perform what is for the machine a cursory
inspection of a small piece of data in such a manner that most
structures that will not satisfy the search requirement will be
rejected immediately. Such a technique is called a "screen" or a
"screening device." It is essential that a screening device should
never cause a structure to be rejected which does, in fact, meet the
search requirement. It is acceptable, however, if the screen allows
some structures to be considered further by the structure search
routine even though they are subsequently rejected as failing to meet
the search requirement.

  By now, it will have become obvious to any chemist that one such
useful screen is inherent in the empirical formula of a chemical
structure. In other words, by storing in the file, along with the
description of the chemical structure, a list of the number of
occurrences of each type of atom in the structure, the ADPS can
inspect this list before it searches the structure to find out whether
there are enough atoms of the right type present to satisfy the search
requirement. This screen was incorporated
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into the SEAC search program and on most searches enabled the computer
to reject quickly the vast majority of structures that would otherwise
have been rejected only after a long computational procedure.


  Other screens that were considered but not experimented with would
make use of topological properties of the chemical structures. One
such simple screen specifies the number of rings within the structure.
Obviously, if one were searching for some generic structure having,
say, three benzene rings, a structure having even 500 atoms in it
could not contain the one being searched for unless there were at
least 3 rings somewhere within the 500-atom structure.

  Other topological properties that could serve as screens would be
the longest non-self-intersecting path that could be traced through
the structure and a count of the number of atoms with each of the
possible valences. However, these topological properties were not
experimentally investigated.

  The important properties to be sought for in devising screens for
any type of searching, chemical or otherwise are that the screens be
substantially independent of each other and that they have universal
applicability to all documents in a collection. These properties are
never completely achieved in practice, but can be approached when the
search is limited to narrowly defined subject matter. The criteria for
evaluating screening devices are, in other words, the same as those
used for determining the value of questions in the familiar game of
"20 Questions." The important difference between the parlor game and
its ADPS search equivalent is that for the latter all 20 questions
must be completely stated before any one of them can be answered.

Grouping of the Information for Searching
-----------------------------------------

  In deciding upon a method for representation of structure diagrams
for search by the ADPS, it was necessary to decide what data about a
structure diagram would be represented in the code as it was stored
in the machine file. Certainly an atom-by-atom description would give
a completely general and flexible method for representing chemical
information. The difficulty arose in certain of the chemical arts where
the distinction between one chemical and the next would be very slight
in the atom-by-atom description. On the other hand, very useful
discrimination among the members of such a class could be made by
including in
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the description certain characteristics that were peculiar to that
particular class. The problem here is one that is undoubtedly a
characteristic of systems for coding information wherever the
information assumes diverse forms; namely, how much generality shall
be sacrificed in order to provide a high degree of discrimination in
certain isolated areas of the collection where most of the documents
contain similar information.

  A solution to the problem that is practical if ADPS searching is to
be used, but which is probably not practical for use with more simple
systems (e.g., punched cards) is to have special coding systems
applicable for certain documents in a collection. These coding systems
serve as adjuncts to one (or more) coding system which describes the
entire collection. If in searching through a collection, the screening
devices fail to reject a document for which a special coding system is
applicable, then special instructions are automatically made available
to the ADPS to enable it to search the special document in terms of
the special coding system being used. It is quite practicable, when
searching with an ADPS to have documents scattered through a
collection that have been described with special coding systems not
generally used for the whole collection. This technique becomes costly
of machine searching time only when the number of such special
documents is sufficiently large that it becomes necessary for the ADPS
frequently to call up its auxiliary instructions to handle the special
coding situation.

  With these considerations in mind, some special coding methods
suggested by the Patent Office were used for coding structures in the
steroid chemical art. These coding rules enabled fine distinctions to
be made between various compounds in the steroid art although the
coding rules depended upon certain chemical structure properties which
were quite peculiar to the steroid art and, therefore, inapplicable
for searching other types of chemical structures.

III. USE OF ADPS AS PART OF AN INFORMATION RETRIEVAL SYSTEM 
          FOR PURPOSES OTHER THAN SEARCHING 


  The experiments and theoretical considerations thus far described
have been concerned with the use of the ADPS as a tool for actually
performing a search through a file of information (in this case,
chemical
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in nature). However the use of ADPS in retrieval of information
extends considerably beyond the activity of actual searching. This 
section describes the use of ADPS for processing data as part of
an overall information retrieval system.


Use of ADPS for Checking Data
-----------------------------

  It is evident that the file of information through which a search
will be performed must be prepared without error since any error in a
recorded piece of information represents the loss of some information
from the file. In the experiments with retrieval of chemical
structures, the original file was prepared by the Patent Office in the
form of about 2,500 punched cards that described about 250 chemical
structures. Since one of the features of the coding scheme used was
its lack of reliance upon chemical knowledge for encoding structures,
a group of punched card typists were given the set of 250 pictures
describing the structures to be encoded. The operators read the
pictures and punched the descriptions on cards without the
intervention of any supervision from a chemist. As was to be expected,
there were some cards out of the 2,500 that contained errors, and the
problem was to find the errors and correct them.

  Since the data were ultimately to be used as input to SEAC, they
were first transcribed from punched cards onto magnetic wire (which
was the principal SEAC input-output medium at the time of the tests
described). Then a program was written for SEAC to take these data
from the wire and check them. It is important to note that this data
checking program was entirely unrelated to the subsequent SEAC search
program. The data were checked for internal consistency and for their
adherence to the coding rules that had been established. The result
was to catch about 50 punched card errors and to produce a copy of
those parts of the file containing no errors. This expurgated file
could then have been used by another machine. If the coding system had
been of suitable nature, other simpler mechanisms could have done the
searching through data that had been checked by ADPS.

  Here, then, is an example of the use of an ADPS as part of an
information retrieval system although the ADPS need not serve as the
searching device. In preparing a large file for occasional use, where
the cost of a large ADPS would not justify its use as a searching
tool, the ADPS might still profitably serve the function of checking
the initial data to be entered in the system.
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Use of ADPS for Transliteration
-------------------------------

Another function that an ADPS can serve as part of an information
retrieval system is that of transliterating data to be entered into
the system from the forms that are most convenient for manual
preparation of the data into forms more suitable for searching by
machine. In the structure search experiment, the data to be used by
SEAC had to be arranged in a certain format for most efficient
utilization of the SEAC memory space. This format was sufficiently
complicated that the punched card operators could not be expected to
prepare the data in that format with any reasonable speed or accuracy.
Consequently, a program was written for SEAC which accepted data in a
format convenient for the punched card operators and transliterated
the data into the form desirable for SEAC searching.

  Again we have an example of the use of the ADPS, not for searching,
but in this case for transliterating data from one form into another.
It should be noted that both the input data for this transliteration
program and the output produced by the machine followed completely
rigorous rules of organization and arrangement. The ADPS was
converting data from one well-defined coded form into another. The
fact that this can be done readily with an ADPS should not lead one to
the conclusion that data expressed in some natural language (e.g.,
English) can be translated by machine into a coded form suitable for
machine search. The problem of machine translation is a formidable one
to which much effort is being devoted 5/ both in the United States and
elsewhere. What is claimed here is that the comparatively simple
problem of transliterating from one code into another can be
conveniently handled by an ADPS.


  Another example of transliteration by SEAC which is the subject of
some current experiments is its use to generate chemical structure
descriptors for use by a more simple searching machine. It has been
suggested by Mooers 6/ that for purposes of retrieval, complex
structures like chemical diagrams can be represented in terms of a
list of, say, all the triples of atoms and bonds occurring within the
structure. Thus, chloral (Figure 3) would be described as consisting
of combinations of the triples listed on the following page.
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                              Cl - C
                               C - C
                               - C -
                               - C =
                               C - H
                               C = O


  A simple search mechanism performing simple comparisons between
similar types of triples in the search request and the file could
retrieve complex structures without the necessity of doing the complex
processing of data of the type performed by the structure search
routine described earlier. It is not within the scope of this report
to discuss the merits of such a search system. However, this example
is mentioned to demonstrate that an ADPS can generate the N-tuple
descriptors from a complete representation of the structure. If a
large file of chemical structures are to be retrieved by 2 very simple
mechanism using Mooers' N-tuple descriptors, the original file may be
prepared in, for example, the form required for the SEAC structure
search routine. It is then a fairly straightforward job to program
SEAC to generate the N-tuple descriptors and consequently to produce
as output a transliterated file all prepared for searching by more
simple mechanisms.


  Some present SEAC experiments are devoted to generation of the
N-tuple descriptors from the file of steroid compounds previously
described. It is intended, then, to run comparative tests on a file of
chemical structures coded according to the two coding schemes. Any
results obtained from such a comparison will be less significant,
however, than the fact that such an experiment demonstrates the
feasibility of simulating on a large ADPS the searching procedure of a
more simple mechanism for purposes of comparative evaluation.


Use of ADPS for Exploring Complex Logical Situations
----------------------------------------------------

  In retrieval of information from Patent Office files, it often
occurs that complex logical conditions must be imposed upon either the
question or the file. A simple example of the way in which an ADPS can
handle complex inter-relationships occurs in the searching of
so-called alternative patent disclosures.
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  It is quite common for an inventor to describe his invention in the
following form: A plus B plus any one of the following three, C, D, or
E, plus any one of the following three, F, G, or H. The elements C,
D, E are alternatives for each other, as are the elements F, G, H, and
any one of the first group may possibly be combined with any one of
the second group for purposes of anticipating a subsequent patent
claim. However, if someone were to claim the combination of A+B+C+D
the patent described above would not be relevant since C and D are
disclosed only as alternatives for each other. It is desirable to be
able to use a search machine that will not accept a patent like the
first one described when searching for the latter.


  One solution to this problem that has been proposed but does not
appear practical is to code the alternative type of disclosure
separately as each of the several types being described. Thus A+B+one
of (C, D, E) +one of (F, G, H) would be represented by 3x3=9 separate
entries. In many real situations a number in the thousands would
describe the number of possible combinations claimed in a chemical
patent.

  Another method for handling this type of situation on SEAC was
suggested by the Patent Office and made use of the structure search
routine. By the introduction of certain dummy elements into the
chemical structure, several alternative structures could be coded in
one large pseudo-structure. If the alternative structure of Figure 6
18 to be coded, it can be represented as shown in Figure 7 where (X)

       One of (C, D, E)
      /
   A-B
      \
       One of (F, G, H)

   Fig. 6


              C
             /
            /
         (X)---D
         /  \
        /    \
       /      E 
      /     
   A-B
      \
       \      F
        \    /
         \  /
         (X)---G
            \
             \
              H

   Fig. 7
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represents a dummy element. It can be seen that Figure 8 (a) is 
contained in the pseudo-structure of Figure 7, but not Figure 8 (b). 


                            (X)---D
                           /
                          /
                       A-B
     A-B-(X)-C            \
                           \
                            (X)---C

       (a)                   (b)


              Figure 8
              
  The handling of this complex type of alternative search is possible
on an ADPS but difficult for a more simple mechanism. In a large
information retrieval system it may be possible to use more simple
mechanisms than an ADPS for searching until a complex logical
situation like the alternative search arises, in which case the file
may be made available to the ADPS for more complete searching.


  There are other complex logical situations that arise in Patent
Office searching for which it is not yet possible to announce
experimental solutions. One particularly difficult one occurs when
there is a reference that is complete except for a minor substitution
of some component A for the desired component B and when in an
entirely separate patent there is a statement attesting to the
equivalence of A and B for the function concerned. It is often
desirable to retrieve such a partly incomplete reference in
conjunction with the reference stating equivalence. To date, however,
no general solution to this problem is known.


                         IV. CONCLUSIONS

  The problems in information retrieval mentioned here have certainly
been known to serious workers in the field for some time. Only
recently, however, have automatic data processing systems
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become sufficiently available to be considered as possible tools in an
information retrieval system. Thus the SEAC experiments indicate the
practicality of using an ADPS for scanning a file of information at
high rates. However, many mechanisms considerably simpler than an ADPS
can also do such scanning, and the question remains open as to the
comparative advantages of an ADPS and more simple mechanisms for the
actual process of looking at a properly organized file. In some
retrieval situations, most notably in the Patent Office, the problem
is of sufficient magnitude and complexity that the power of an ADPS to
do more than just scan a file appears at first inspection to be a
requirement. Where the ADPS seems to offer a unique contribution is in
the off-line jobs. For such functions as preparing a search
prescription, editing a file, eliminating errors, transliterating from
one code to another, exploring complex logical conditions imposed on
the question and file, and probably many others, the ADPS offers the
outstanding virtues of high speed and great versatility. Thus it is
possible to use SEAC not only to test the utility of an ADPS for the
Patent Office retrieval problem but also to study the performance of
other devices by simulating them on SEAC. In the computing machine
field it is a well-known phenomenon that machine users discover many
new applications of these machines while in the process of using them.
It is hoped that the experiments on SEAC will serve a similar purpose.
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